Life-Expectancy has constantly increased over the last centuries, and it has never been so high as nowadays. Medicine has taken great strides forward, eradicating and finding cures for a lot of mortal diseases. Having said that, there are still other big challenges that we have to face in medicine and one of those is for sure cancer. Cancer is a leading cause of death all over the world, accounting for almost 10 milion deaths in 2020. This is the reason why we thought that would be very interesting to analyze the principal factors that, according to state-of-the-art scientific research, cause this disease. We also wanted to divide these factors in some macro-groups of variables: environmental, life-style related and socio-economical/demographic.
Our analysis (purely statistical) aims to verify if it is indeed possible using data about the selected risk factors to predict the share of people with cancer in a given year and in a given country. Moreover, we tried to improve our all-inclusive model by finding out the best model out of the above mentioned factors
Our research started by searching out on some reliable websites what are the leading causes of cancer and dividing them into the macro-categories. To do that we adopted the WHO official website and we selected the following risk factors:
(IMPORTANT DISCLAMER: of course the factors defined under “Socio-Economical/Demographic” cannot be considered to cause cancer, but as we will see they turn out to be very useful to model cancer all over the world and over time)
We built our dataset selecting by merging different csv files all taken from the same source, that is called “Our world in data”. This is by itself collecting datasets from verified sources and the precise reference to each factor is given in the sitography at the end of the paper. Here we provide a complete description of the variable we decided to use in our model.
Cancer: Share of total population with any form of cancer, measured as the age-standardized percentage. This share has been age-standardized assuming a constant age structure to compare prevalence between countries and through time.
Air_pollution: Population-weighted average level of exposure to concentrations of suspended particles measuring less than 2.5 microns in diameter (PM2.5). Exposure is measured in micro grams of PM2.5 per cubic meter (µg/m³).
Alcohol: Average per capta consumption of alcoholic beverages, measured in kilograms per year. Data is based on per capta food supply at the consumer level, but does not account for food waste at the consumer level
GDP per capta: Measured in constant international-$
Obesity: Obesity is defined as having a body-mass index (BMI) equal to or greater than 30. BMI is a person’s weight in kilograms divided by his or her height in metres squared
Old age Dependency: This is the ratio of the number of people older than 64 relative to the number of people in the working-age (15-64 years). Data are shown as the proportion of dependents per 100 working-age population.
Smoking: Share of population who smoke every day
We started by loading all the different csv files. Then we took the column we were interested in and we glued them together. After some other little manipulation (see the R code for that) here is how part of the final table looks like (the complete table with the N.A.s results in a dimension of 6468x11):
##We start by loading some useful libraries that we will need during this process
library(readr) #Used to read the csv
library(tidyverse) #Contains a lot of useful packages to clean the datas
library(gganimate) #To create animated plots
library(ggthemes) #To select some nice themes
library(zoo) #To fill the N.A. values
library(grid) #To display plots into a grid
library(gridExtra) #To display plots into a grid
library(ggpubr)#To display plots into a grid
##We now load the row csv files what we need to merge
Cancer <- read_csv("Datasets/Cancer.csv")
Air_pollution <- read_csv("Datasets/Air pollution.csv")
Alcool <- read_csv("Datasets/Alcool.csv")
GDP_per_capta <- read_csv("Datasets/GDP per capta.csv")
Obesity <- read_csv("Datasets/Obesity.csv")
Old_age_dependency_ratio <- read_csv("Datasets/Old age dependency ratio.csv")
Smoking <- read_csv("Datasets/Smoking.csv")
Population <- read_csv("Datasets/population-since-1800.csv")
countryContinent <- read_csv("Datasets/countryContinent.csv")
#We proceed by taking the explanatory variable data set (Cancer) and adding the covariates as columns other columns, to do that we use the use the libraby "dplyr" as it makes it very easy to do and readable
Continent = countryContinent %>% select(country, continent)
Continentt = rename(Continent, Entity = country)
Cancer1 = left_join(Cancer, Air_pollution, by = c("Entity", "Code", "Year"))
Cancer2 = left_join(Cancer1, Alcool, by = c("Entity", "Code", "Year"))
Cancer3 = left_join(Cancer2, GDP_per_capta, by = c("Entity", "Code", "Year"))
Cancer4 = left_join(Cancer3, Obesity, by = c("Entity", "Code", "Year"))
Cancer5 = left_join(Cancer4, Old_age_dependency_ratio, by = c("Entity", "Code", "Year"))
Cancer6 = left_join(Cancer5, Smoking, by = c("Entity", "Code", "Year"))
Cancer7 = left_join(Cancer6, Continentt, by = "Entity")
Cancer_final = Cancer7
rm("Cancer1", "Cancer2", "Cancer3", "Cancer4", "Cancer5", "Cancer6", "Cancer7")
rm("Air_pollution", "Alcool", "Cancer", "GDP_per_capta", "Obesity", "Old_age_dependency_ratio", "Smoking", "Continent", "Continentt", "countryContinent")
#We now proceed by renaming the columns. This will make all the manipulations of the dataframe more readable:
colnames(Cancer_final)[4] <- "Cancer"
colnames(Cancer_final)[5] <- "AirPoll"
colnames(Cancer_final)[6] <- "Alcool"
colnames(Cancer_final)[7] <- "GDP"
colnames(Cancer_final)[8] <- "Obesity"
colnames(Cancer_final)[9] <- "Age"
colnames(Cancer_final)[10] <- "Smoking"
Cancer_final
write.csv(Cancer_final, "Cancer_final.csv", row.names=FALSE, quote=FALSE) #to save the file into our environment
Now we will proceed with some plotting. Again, for the detailed process for making this graphs please refer to the R script. You may also want to check our our file in Html format, in which we made some nice animation to be able to display the evolution in time of the variables.
We start by displaying the variable we are interested to explain, that is the share of population with any form of cancer. We chose to plot the most recent year in our data frame, that is 2017. As we can clearly see form the plot below the countries in which are suffering more (in percentage) of cancer are USA, Canada, Australia and Greenland. These states are followed by European countries. Asian and African countries are the ones with relatively less cases of cancer in percentage. Here we could argue that a big downside of this picture is the fact that different countries have different ability in the process of screening. As the WHO writes “Late-stage presentation and lack of access to diagnosis and treatment are common, particularly in low- and middle-income countries”. We will further discuss this limitation in the dedicated section. (Countries for which data were not available are omitted from the map, but those are luckily few)
#We read the file needed
share_of_population_with_cancer <- read_csv("Cancer.csv") #read the file
share_of_population_with_cancer = data.frame(lapply(share_of_population_with_cancer, function(x) {
gsub("United States", "USA", x) #correction needed for the mapping
}))
colnames(share_of_population_with_cancer)[1] <- "region" #Give the same name for the country in the two datasets
colnames(share_of_population_with_cancer)[4] <- "cancer" #Dropping the complicated name given by the data frame
# Create the actual map that we will use
#We will use the ggplo2 package for that
mapdata = map_data("world") #Used to take the latitude and longitude of the countries
mapp <- mapdata %>%
left_join(share_of_population_with_cancer, by = "region") #merge the two datasets
mapp2 = mapp %>%
filter(!is.na(mapp$cancer)) #dealing with NA
mapp2$cancer <- as.numeric(mapp2$cancer) #changing the value to numeric for plotting
mapp2$Year <- as.numeric(mapp2$Year) #changing the value to numeric for animation
# Building the map
#Here we create a static map
ggplot( data = mapp2, aes(long, lat, group = subregion)) +
geom_map(
aes(map_id = region),
map = mapdata,
color = "black", fill = "gray30", size = 0.3
) +
geom_polygon(aes(group = group, fill = cancer), color = "black") +
scale_fill_gradient2(low = "blue", high = "red" , midpoint = 1.0355583) +
theme_minimal() +
transition_time(Year)
labs(
title = "Share of population with cancer in {round(frame_time}",
x = "", y = "", fill = ""
) +
theme_bw()
shadow_wake(wake_length = 0.05)
animate(graphPoll, height = 500, width = 800, fps = 30, duration = 15, end_pause = 60, res = 100)
anim_save("Share with cancer world.gif")
-> This is not the file you would get running the code, but its static version, My computer just doesn’t have enough computational power to produce this animation and shuts down when I try to create the animation : (
Now we are going to explore the dataset by plotting each selected factor against the Cancer one to start having an idea of some possible relation. Because we selected them as “risk factors” we would expect to see some positive correlation going on. To be consistent with what we have done above we again plot the graphs relative to the year 2017. We also used some libraries to make plots nicer by coloring the points according to the continent and by changing the size according to the population of the country. The result is the following:
#Here we correct the N.A.s by filling the closest non empty value (this is needed for plotting)
#and we will not do any analysis with these fake data
Cancergraph = left_join(Cancer_final, Population, by = c("Entity", "Code", "Year"))
Cancergraph$Smoking = na.locf(Cancergraph$Smoking)
Cancergraph$AirPoll = na.locf(Cancergraph$AirPoll)
Cancergraph$Alcool = na.locf(Cancergraph$Alcool)
Cancergraph$GDP = na.locf(Cancergraph$GDP, fromLast = TRUE)
Cancergraph$Age = na.locf(Cancergraph$Age)
Cancergraph$Obesity = na.locf(Cancergraph$Obesity)
Cancergraph$`Population (historical estimates)` = na.locf(Cancergraph$`Population (historical estimates)`)
#Here we create and save each plot, This chunk of code will not be evaluated every time because is pretty heavy and time/CPU consuming. The results (trust me) are included at the end
graphPoll = Cancergraph %>%
ggplot(aes(x = AirPoll, y = Cancer, color = continent, size = `Population (historical estimates)`)) +
geom_point(alpha = 0.9, stroke = 0) +
scale_size(range = c(2,12), guide = "none") +
scale_color_brewer(palette = "Set2") +
labs(title = "Cancer incidence vs Air Pollution",
x = "Air pollution exposure ",
y = "Share of people with Cancer",
color = "Continent",
caption = "Source: Our World in Data") +
transition_time(Year) +
labs (subtitle = "Year:{round(frame_time)}")
shadow_wake(wake_length = 0.05)
animate(graphPoll, height = 500, width = 800, fps = 30, duration = 15, end_pause = 60, res = 100)
anim_save("Airpoll Visual Animation.gif")
graphAlco = Cancergraph %>%
ggplot(aes(x = Alcool, y = Cancer, color = continent, size = `Population (historical estimates)`)) +
geom_point(alpha = 0.9, stroke = 0) +
scale_size(range = c(2,12), guide = "none") +
scale_color_brewer(palette = "Set2") +
labs(title = "Cancer incidence vs Alcool consumption",
x = "Alcool consumed",
y = "Share of people with Cancer",
color = "Continent",
caption = "Source: Our World in Data") +
transition_time(Year) +
labs (subtitle = "Year:{round(frame_time)}")
shadow_wake(wake_length = 0.05)
animate(graphAlco, height = 500, width = 800, fps = 30, duration = 15, end_pause = 60, res = 100)
anim_save("Alcool Visual Animation.gif")
graphGDP = Cancergraph %>%
ggplot(aes(x = GDP, y = Cancer, color = continent, size = `Population (historical estimates)`)) +
geom_point(alpha = 0.9, stroke = 0) +
scale_size(range = c(2,12), guide = "none") +
scale_color_brewer(palette = "Set2") +
labs(title = "Cancer incidence vs GDP",
x = "GDP per capta",
y = "Share of people with Cancer",
color = "Continent",
caption = "Source: Our World in Data") +
transition_time(Year) +
labs (subtitle = "Year:{round(frame_time)}")
shadow_wake(wake_length = 0.05)
animate(graphGDP, height = 500, width = 800, fps = 30, duration = 15, end_pause = 60, res = 100)
anim_save("GDP Visual Animation.gif")
graphObes = Cancergraph %>%
ggplot(aes(x = Obesity, y = Cancer, color = continent, size = `Population (historical estimates)`)) +
geom_point(alpha = 0.9, stroke = 0) +
scale_size(range = c(2,12), guide = "none") +
scale_color_brewer(palette = "Set2") +
labs(title = "Cancer incidence vs Obesity",
x = "Share of Obese people",
y = "Share of people with Cancer",
color = "Continent",
caption = "Source: Our World in Data") +
transition_time(Year) +
labs (subtitle = "Year:{round(frame_time)}")
shadow_wake(wake_length = 0.05)
animate(graphObes, height = 500, width = 800, fps = 30, duration = 15, end_pause = 60, res = 100)
anim_save("Obesity Visual Animation.gif")
graphAge = Cancergraph %>%
ggplot(aes(x = Age, y = Cancer, color = continent, size = `Population (historical estimates)`)) +
geom_point(alpha = 0.9, stroke = 0) +
scale_size(range = c(2,12), guide = "none") +
scale_color_brewer(palette = "Set2") +
labs(title = "Cancer incidence vs Age",
x = "Old people ratio",
y = "Share of people with Cancer",
color = "Continent",
caption = "Source: Our World in Data") +
transition_time(Year) +
labs (subtitle = "Year:{round(frame_time)}")
shadow_wake(wake_length = 0.05)
animate(graphAge, height = 500, width = 800, fps = 30, duration = 15, end_pause = 60, res = 100)
anim_save("Age Visual Animation.gif")
graphSmoke = Cancergraph %>%
ggplot(aes(x = Smoking, y = Cancer, color = continent, size = `Population (historical estimates)`)) +
geom_point(alpha = 0.9, stroke = 0) +
scale_size(range = c(2,12), guide = "none") +
scale_color_brewer(palette = "Set2") +
labs(title = "Cancer incidence vs Smoking",
x = "Number of Smokers",
y = "Share of people with Cancer",
color = "Continent",
caption = "Source: Our World in Data") +
transition_time(Year) +
labs (subtitle = "Year:{round(frame_time)}")
shadow_wake(wake_length = 0.05)
animate(graphSmoke, height = 500, width = 800, fps = 30, duration = 15, end_pause = 60, res = 100)
anim_save("Smoking Visual Animation 3.gif")
#And eventually here it is the animations:
knitr::include_graphics("Airpoll Visual Animation.gif")
knitr::include_graphics("Alcool Visual Animation.gif")
knitr::include_graphics("GDP Visual Animation.gif")
knitr::include_graphics("Obesity Visual Animation.gif")